home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Monster Media 1996 #15
/
Monster Media Number 15 (Monster Media)(July 1996).ISO
/
internet
/
htmst604.zip
/
HTMSTRIP.DOC
< prev
next >
Wrap
Text File
|
1996-04-13
|
18KB
|
351 lines
HTMSTRIP.DOC 1 Revised: 04/13/96
The HTMSTRIP.EXE program attempts to read HTML pages, remove the HTML coding,
and write the file out as something more useful. Features of this program:
* Can be run across an entire subdirectory (for example, your entire
cache subdirectory), and will only process the HTML documents that it finds.
(There are some options on this.)
* Removes all imbedded HTML commands.
* Recodes the standard HTML "entity references" (e.g. "©" becomes
"(c)"). The actual replacements are coded in a user-modifiable lookup file.
* Handles standard indent, heading, selection groups, menus, tables, etc.
* Reflows all text as appropriate
* Optionally, will replace Link, Image, and Input references with
user-definable text representations.
* Optionally, alerts you to possible errors in the HTML code itself.
HTML codes are surrounded within <...> indicators. For upward compatibility
reasons, Web browsers ignore any codes that they don't understand and just
process the ones they can handle.
Note that the HTMSTRIP command is currently geared for handling HTML 2.0 files
and then Netscape table-specific extensions (added to HTML 3.0).
HTMSTRIP removes all HTML codes. It also handles the standard HTML "&xxx;"
"entity references" (e.g. "©" is replaced by "(c)"). You can add or change
these replacements as desired by using the INI file (documented later).
HTMSTRIP is also tuned to allow it to specially-handle several embedded HTML
codes. These codes are the following:
<A ...> External link
<BLOCKQUOTE>...</BLOCKQUOTE> Indented block of text
<BR> Forced line break
<CAPTION>...</CAPTION> Title for a table
<CENTER>...</CENTER> Centering text
<DD> Term definition
<DIR>...</DIR> Directory list of items
</DL> End of definition list
<DT> First term of definition list/glossary
<H1> to <H6>...</H1> to </H6> Heading items
<HR> Horizontal rule
<IMG ...> Image
<INPUT ...> User input
<LI> Menu/Ordered/Unordered/Directory list item
<MENU>...</MENU> Menu listing
<OL>...</OL> Ordered listing
<OPTION> Used for single/multiple choice menus
<P> Paragraph indicator
<PRE>...</PRE> Preserve spacing block (preformatted text)
<SELECT>...</SELECT> Block for single/multiple choice menu
<TABLE>...</TABLE> Table block
<TD>...</TD> Table data (cell)
<TH>...</TH> Table heading
<TITLE>...</TITLE> Title item
<TR>...</TR> Table row
<UL>...</UL> Unordered listing
HTMSTRIP.DOC 2 Revised: 04/13/96
If you run across other codes that become vital, let me know and I'll try to
handle them somehow.
How to get HTML files:
Some people who are using regular Web browsers like Mosaic or Netscape don't
realize that they're automatically saving HTML files to their hard disk
throughout every Web session. That's because just about every Web browser saves
the most-recently accessed files from the Web (including HTML source code,
GIF's, and JPG's) on your hard disk and reads them from there instead of
requiring you to download them every time you go back to a previous page. This
is typically settable by you under "Preferences" and "Cache" on your Web
browser.
I usually set my Web browser to have a huge cache, maybe 10MB. Anything beats
downloading the same pages over again even at 28.8K. And I make sure that I do
not have anything specified like "clear cache at the end of every session". Then
I just go through the files in the cache subdirectory afterward and reprocess
them.
Two disadvantages to a cache... It takes up hard disk space but, hey, the Web
browser is typically in Windows so why are you surprised. The second
disadvantage is that if the page actually changes between sessions, you
typically won't notice the new page as long as it remains in your cache. If you
think a page is still in cache and should have been changed but didn't, you can
typically ask your Web browser to reload the page. On some browsers, this is
shown as an arrow in the form of a circle.
HTMSTRIP can process the entire cache subdirectory. It automatically detects
non-HTML files for you and processes accordingly. It creates new text file
versions of just the HTML pages it finds.
By the way, for some reason, the current beta version of Netscape typically
ignores my cache setting for some reason. I don't have the slightest idea why.
As a result, when you Alt-F4 out of Netscape, it goes through and deletes all
but a few of the temporary files. This is annoying to say the least. As a
result, I have to run HTMSTRIP from a DOS window just before leaving Netscape.
If anyone knows why it does this to me, please let me know!
Specifying parameters:
Parameters for this program can be set in the following ways. The last setting
encountered always wins:
- Read from an *.INI file (see BRUCEINI.DOC file),
- Through the use of an environmental variable (SET HTMSTRIP=whatever), or
- From the command line (see "Syntax" below)
HTMSTRIP.DOC 3 Revised: 04/13/96
Defining entity references:
HTMSTRIP will process an entity reference definition file is one is found. This
table can be in your standard *.INI file (e.g. HTMSTRIP.INI) if desired or it
can be a separate file specified using the /Linitfile parameter.
Entity references are how non-standard characters like the copyright character
are handled in HTML pages. Entity references are indicated as "&xxx;" where
"xxx" is either a code or a number preceded by a pound sign. The copyright
symbol is indicated in HTML as "©".
A default HTMSTRIP.INI is provided with over 230 entity reference lookups. To
define or change these lookups, the INI file should include a series of lines in
the following format:
&xxx; = outstr
where "&xxx;" is the HTML sequence and "outstr" is what you want to replace it
with. The "outstr" portion can consist of regular non-space ASCII text
characters (like "A" or "z") as well as hexadecimal values (in the form &Hxx) or
decimal values (in the form \nnn). (See the BRUCEHEX.DOC file.) It can also be
the word "NULL" which translates the string into nothing. You cannot use a
space or equal sign in "outstr"; use the hexadecimal or decimal representations
instead. The table does not have to be in any specified order. Lines can end
with "/*" followed by a comment if you want. Examples:
© = (c) /* Copyright symbol
° = °
é = é
ê = ê
è = è
= \032
Remember that "&xxx;" entity references (yes, I hate that phrase) are
case-sensitive in HTML. "°" will not find "&Deg;".
There seems to be a trend of late to relax some of the replacement coding
requirements in Web pages. The ";" is now, apparently, becoming optional.
Numeric replacements (e.g. " ") seem to no longer require the leading pound
sign. Therefore, HTMSTRIP looks for both of these iterations for any
appropriate lookup. "©" will find "©" and "™" will find "&153".
The lookup itself has to be entered in the formally correct way though.
You are also allowed to redefine the strings that are used for three symbolic
references in the file. These show up only if /SYMBOLS is specified. By
default, you will see the following:
for <A> external links -> (link)
for <IMG> image references -> (image)
for <INPUT> user inputs -> [Input]
HTMSTRIP.DOC 4 Revised: 04/13/96
You can redefine any and all of these entity references in the same lookup file.
These substitutions are specified more or less like the previous substitutions:
<A> = (link)
<IMG> = (image)
<INPUT> = [Input]
Unlike with the other lookups, the left side is not case sensitive so
"<a>=(link)" works just fine. Hexadecimal and decimal replacements are again
acceptable (see BRUCEHEX.DOC file). You might, for example, want to redefine
some of them like this:
<A> = \251 /* Replaces with a √ symbol
<IMG> = \015 /* Replaces with a symbol (little flash cube)
<INPUT> = ? /* Replaces with a question mark
Any symbolic references that you do not redefine will default to their original
values. If /-SYMBOLS is specified, any symbolic definitions are ignored and a
"NULL" replacement string is used for all of them.
HTMSTRIP.DOC 5 Revised: 04/13/96
Syntax:
HTMSTRIP { filespec | @listfile } [ outfile ] [ /EXT=.xxx ]
[ /WIDTH=n ] [ /SYMBOLS | /-SYMBOLS ] [ /ALL ] [ /SITE | /FSITE | /-SITE ]
[ /ALT ] [ /SPACES | /-SPACES ] [ /WARNINGS | /-WARNINGS ]
[ /RULE=s ] [ /BORDER=c ] [ /BUFF=n ]
[ /Iinitfile | /-I ] [ /Linitfile ] [ /? ] [ /?&H ]
where:
"filespec" tells the routine which file or files are to be processed. The
specification can include path and wildcards if desired. Typically, the file
names are *.HTM files.
"@listfile" allows you to have a variety of file specifications saved in a text
file named "listfile". Each line in the file should consist of one file
specification, each of which can include a path and wildcards if desired. Blank
lines and lines beginning with semi-colons, colons, or quotes are ignored.
"outfile" is the name of the output file to create. Is overwritten if it exists
already. If no output file name is provided, the routine will use the infile
and provide an extension of *.OUT. (The default .OUT extension can be
overridden using the /EXT=.xxx parameter.) An outfile cannot be specified if
wildcards or @listfile are used for the input file specification.
"/EXT=.xxx" allows you to specify a different default file extension for the
output file. This parameter only matters if you do not explicitly specify an
output file name. Initially defaults to "/EXT=.OUT".
"/WIDTH=n" specifies the desired line length for wrapping long lines and also
for centering. Initially defaults to "/WIDTH=80".
"/SYMBOLS" says to allow (unless redefined in your INI file) the "(link)",
"(image)", and "[Input]" indicators. Initially defaults to "/-SYMBOLS".
"/-SYMBOLS" skips the indicators even if they're defined in your INI file. This
is initially the default.
"/ALL" says that if the program encounters what it thinks is just a text file,
it should take the file and try to fix up CR/LF problems (Unix files end with
LF's instead of CR/LF which is what DOS needs) and that's it. This can be
somewhat risky since it may misdiagnose a file but it should be safe if you're
running it on your cache subdirectory. Initially defaults to "/-ALL" which
means it won't process it unless it thinks it's an HTML file.
"/SITE" shows the name of any <A HREF=...> location in the output file. For
example, if a link goes to a specific Web page, the output file may include some
reference like [http://www.thex-files.com/upepis.htm/]. Initially defaults to
"/-SITE" (do not show the site name).
"/FSITE" is similar to /SITE except all of the references are shown as footnotes
instead of being left in the text itself. Initially defaults to "/-SITE".
"/-SITE" shows, at best, the symbolic reference if a link is provided on a page.
Instead of some [http://...] thing, you'll see (link) provided that /SYMBOLS are
turned on. Initially defaults to "/-SITE".
HTMSTRIP.DOC 6 Revised: 04/13/96
"/ALT" turns on the printing of the "Alt=" indicator in an <IMG...> statement.
These are sometimes created by the page designer for use on buttons for viewers
who don't have graphical support. Since text-only Web browsers are dying out,
this is probably a standard which won't continue forever but it can't hurt. If
/ALT is specified, these alternate texts show up independently of the /SYMBOLS
setting. Initially defaults to "/-ALT".
"/-ALT" prevents the Alt= text in <IMG...> statements from showing up. This is
initially the default.
"/SPACES" turns off extra vertical spacing between sections. There are
frequently lots of extra blank lines that appear in the output file either due
to specific HTML requests or to insure proper reformatting. Specifying /SPACES
allows these to stay there.
"/-SPACES" removes these extra blank lines. This is initially the default.
"/WARNINGS" displays warnings when HTMSTRIP finds either internal problems in
the document or things it can't handle. Initially defaults to "/-WARNINGS".
"/-WARNINGS" turns off the warning messages. This is initially the default.
"/RULE=s" specifies that a string is to be repeated the width of the line. This
is used to separate sections. The string can be a single character (like
"/RULE=-"), multiple characters (like "/RULE="- ""), it can contain decimal and
hexadecimal characters (like "/RULE=\066\097\116"--see BRUCEHEX.DOC), it can be
"/RULE=NULL" (which typically results in a blank line), or just simply "/RULE"
(which is the same thing as "/RULE=-" if /BORDER=T and "RULE=\196" if /BORDER=S
or /BORDER=D). Personally, if your printer supports IBM graphics characters, I
find "/RULE=\196" to be the most pleasing of the rule lines.
"/BORDER=c" specifies the type of border to use. The possible choices for "c"
are "D" (double), "S" (single), "T" (text), "B" (blanks), or "N" (none).
/BORDER=B shows spaces instead of delimiters whereas /BORDER=N skips the blank
lines between cells entirely.. Examples of the other three:
<T>ext <S>ingle <D>ouble
+-----+-----+-----+ ┌─────┬─────┬─────┐ ╔═════╦═════╤═════╗
| 1 | 2 | 3 | │ │ │ │ ║ ║ │ ║
+-----+-----+-----+ ├─────┼─────┼─────┤ ╠═════╬═════╪═════╣
| 4 | 5 | 6 | │ │ │ │ ║ ║ │ ║
+-----+-----+-----+ ├─────┼─────┼─────┤ ╟─────╫─────┼─────╢
| 7 | 8 | 9 | │ │ │ │ ║ ║ │ ║
+-----+-----+-----+ └─────┴─────┴─────┘ ╚═════╩═════╧═════╝
"/BUFF=n" specifies how many spaces to position on either side of the vertical
bars in the tables. Defaults to /BUFF=1.
HTMSTRIP.DOC 7 Revised: 04/13/96
"/Iinitfile" says to read an initialization file with the file name "initfile".
The file specification *must* contain a period. If no drive or path information
is specified, the program will search for initfile beginning in your default
subdirectory and then going throughout your DOS path. The use of an
initialization file is optional. Initially defaults to "/IHTMSTRIP.INI".
"/-I" (or "/INULL") says to skip loading the initialization file.
"/Linitfile" says that the "&xxx;" and "<A>" etc lookup codes are found in a
file other than from the default "/Iinitfile" file. This is primarily useful if
you want to have a master *.INI file and a separate code lookup table.
"/?" or "/HELP" or "HELP" shows you the syntax for the command.
"/?&H" gives you a hexadecimal and decimal conversion table.
Author:
This program was written by Bruce Guthrie of Wayne Software. It is free for use
and redistribution provided relevant documentation is kept with the program, no
changes are made to the program or documentation, and it is not bundled with
commercial programs or charged for separately. People who need to bundle it in
for-sale packages must pay a $50 registration fee to "Wayne Software" at the
following address.
Additional information about this and other Wayne Software programs can be found
in the file BRUCEymm.DOC which should be included in the original ZIP file.
("ymm" is replaced by the last digit of the year and the two digit month of the
release. BRUCE508.DOC came out in August 1995. This same naming convention is
used in naming the ZIP file that this program was included in.) Comments and
suggestions can also be sent to:
Bruce Guthrie
Wayne Software
113 Sheffield St.
Silver Spring, MD 20910
fax: (301) 588-8986
e-mail: bguthrie@nmaa.org
http://hjs.geol.uib.no/guthrie/
See BRUCEymm.DOC file for additional contact information.
Foreign users: Please provide an Internet e-mail address in all correspondence.